♥ Introduction ♥

The objective of this project is to analyze the Hitters dataset using statistical and machine learning techniques in order to predict the salary level of major league baseball players. The main focus is in applying and evaluating a multinomial logistic regression model, using a wide set of performance statistics as predictors. The salary variable, originally quantitative, is transformed into a categorical variable through salary tertiles. By implementing proper data preprocessing, visualization, model fitting, and cross-validation strategies, we aim to identify the most accurate and interpretable model. The final outcome of this work is to evaluate the predictive power of the full model versus reduced models, while also gaining insight into the key performance metrics that drive salary differences in professional baseball.

library(tidyverse)
library(nnet)
library(tidyr)
library(ggplot2)
library(dplyr)
library(knitr)
library(corrplot)
library(ggcorrplot)
library(DT)
library(skimr)
library(tibble)
library(e1071)
library(dplyr)
library(patchwork)
library(car)
library(ggthemes)
library(scales)
library(viridis)
library(ggrepel)
library(glmnet)
library(reshape2)
library(plotly)

Loading the data

Hitters <- read.csv("Hitters.csv")

EDA

General Overview

In this section, we examine the overall structure of the Hitters dataset. The code creates a summary table showing the variable names, their data types, and sample values using the tibble, sapply, and head functions. This helps quickly understand what kind of data we’re working with. Additionally, we use the skim() function from the skimr package to produce an in-depth summary of the dataset, including number of rows, column types, missing values, and basic statistics such as mean, standard deviation, and quartiles for each variable.

structure_table <- tibble(
  Variable = names(Hitters),
  Type = sapply(Hitters, function(x) class(x)[1]),
  Example_Values = sapply(Hitters, function(x) paste(head(x, 3), collapse = ", "))
)

knitr::kable(structure_table, caption = "Structure of the Hitters Dataset")
Structure of the Hitters Dataset
Variable Type Example_Values
AtBat integer 293, 315, 479
Hits integer 66, 81, 130
HmRun integer 1, 7, 18
Runs integer 30, 24, 66
RBI integer 29, 38, 72
Walks integer 14, 39, 76
Years integer 1, 14, 3
CAtBat integer 293, 3449, 1624
CHits integer 66, 835, 457
CHmRun integer 1, 69, 63
CRuns integer 30, 321, 224
CRBI integer 29, 414, 266
CWalks integer 14, 375, 263
League character A, N, A
Division character E, W, W
PutOuts integer 446, 632, 880
Assists integer 33, 43, 82
Errors integer 20, 10, 14
Salary numeric NA, 475, 480
NewLeague character A, N, A
skim(Hitters)
Data summary
Name Hitters
Number of rows 317
Number of columns 20
_______________________
Column type frequency:
character 3
numeric 17
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
League 0 1 1 1 0 2 0
Division 0 1 1 1 0 2 0
NewLeague 0 1 1 1 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
AtBat 0 1.00 381.23 153.35 16.0 256 379 512 687 ▁▇▇▇▅
Hits 0 1.00 101.14 46.49 1.0 64 96 137 238 ▂▇▆▃▁
HmRun 0 1.00 10.75 8.74 0.0 4 8 16 40 ▇▃▂▁▁
Runs 0 1.00 50.89 26.02 0.0 30 48 69 130 ▅▇▆▃▁
RBI 0 1.00 47.95 26.07 0.0 28 44 63 121 ▃▇▃▃▁
Walks 0 1.00 38.51 21.33 0.0 22 35 53 105 ▅▇▅▂▁
Years 0 1.00 7.43 4.93 1.0 4 6 11 24 ▇▅▂▂▁
CAtBat 0 1.00 2646.22 2327.24 19.0 822 1928 3919 14053 ▇▃▂▁▁
CHits 0 1.00 717.31 655.93 4.0 209 506 1051 4256 ▇▃▁▁▁
CHmRun 0 1.00 68.94 86.03 0.0 14 37 90 548 ▇▁▁▁▁
CRuns 0 1.00 358.15 334.22 1.0 101 247 518 2165 ▇▂▁▁▁
CRBI 0 1.00 328.83 332.82 0.0 91 219 421 1659 ▇▂▁▁▁
CWalks 0 1.00 258.01 264.74 0.0 68 170 337 1566 ▇▂▁▁▁
PutOuts 0 1.00 290.54 282.31 0.0 109 212 325 1378 ▇▃▁▁▁
Assists 0 1.00 107.94 137.44 0.0 7 40 166 492 ▇▂▁▁▁
Errors 0 1.00 8.09 6.39 0.0 3 6 11 32 ▇▅▂▁▁
Salary 58 0.82 532.81 451.64 67.5 190 420 750 2460 ▇▃▁▁▁

Interpretation

Overview of the Categorical Data

cat_vars <- Hitters %>% select_if(is.character)

plots <- lapply(names(cat_vars), function(var) {
  ggplot(Hitters, aes_string(x = var)) +
    geom_bar(fill = "#914777", color = "black") +
    labs(title = paste("Barplot of", var),
         x = var, y = "Count") +
    theme_minimal(base_size = 12) +
    theme(
      plot.title = element_text(hjust = 0.5, face = "bold"),
      axis.text.x = element_text(angle = 0)
    )
})
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
wrap_plots(plots, nrow = 1)

Interpretation

Checking Data Integrity

num_duplicates <- sum(duplicated(Hitters))
cat("Number of duplicate rows:", num_duplicates, "\n")
## Number of duplicate rows: 0
cat("Missing values before removal:\n\n")
## Missing values before removal:
missing_before <- colSums(is.na(Hitters))
print(missing_before[missing_before > 0])
## Salary 
##     58
Hitters <- na.omit(Hitters)

cat("\n Missing values after removal:\n\n")
## 
##  Missing values after removal:
missing_after <- colSums(is.na(Hitters))
print(missing_after[missing_after > 0])
## named numeric(0)
cat("\n New dataset dimensions:\n")
## 
##  New dataset dimensions:
print(dim(Hitters))
## [1] 259  20

Interpretation

Encoding Categorical Variables

This code converts the three categorical variables (League, Division, and NewLeague) from character type to numeric codes using as.factor() followed by as.numeric(). This is necessary because most machine learning models in R require numeric input, not text or factor labels.

Hitters$League <- as.numeric(as.factor(Hitters$League))
Hitters$Division <- as.numeric(as.factor(Hitters$Division))
Hitters$NewLeague <- as.numeric(as.factor(Hitters$NewLeague))

str(Hitters[, c("League", "Division", "NewLeague")])
## 'data.frame':    259 obs. of  3 variables:
##  $ League   : num  2 1 2 2 1 2 1 2 1 2 ...
##  $ Division : num  2 2 1 1 2 1 2 2 1 2 ...
##  $ NewLeague: num  2 1 2 2 1 1 1 2 1 2 ...
unique(Hitters$League)
## [1] 2 1
unique(Hitters$Division)
## [1] 2 1
unique(Hitters$NewLeague)
## [1] 2 1

Interpretation * After encoding, each category is now represented by either 1 or 2. For example, League A and N are now 1 and 2, respectively. This transformation ensures that the categorical variables can be used properly in the regression model without causing erros.

Assessing Skewness

This chunk identifies skewed numerical variables using the skewness() function and applies a log transformation to reduce the skewness.

A threshold of absolute skewness > 1 was used to define which variables needed transformation.

We visualize the distributions before and after the transformation for comparison. Then, the code overwrites the original variables with their log-transformed versions and summarizes the skewness reduction in a comparison table.

This step is important to improve model performance and meet assumptions of algorithms that expect more normally distributed input.

numeric_vars <- Hitters[, sapply(Hitters, is.numeric)]

skew_vals <- sapply(numeric_vars, function(x) skewness(x, na.rm = TRUE))

skewed_vars <- names(skew_vals[abs(skew_vals) > 1]) 

par(mfrow = c(2,2))
for (var in head(skewed_vars, 4)) {
  hist(Hitters[[var]], main = paste("Original:", var), col = "#c268b7", xlab = var)
  hist(log(Hitters[[var]] + 1), main = paste("Log-Transformed:", var), col = "#76226b", xlab = paste("log(", var, ")"))
}

par(mfrow = c(1,1))

for (var in skewed_vars){
  if (all(Hitters[[var]] >= 0)) {
    Hitters[[var]] <- log(Hitters[[var]] + 1)
  }
}

post_skew <- sapply(Hitters[skewed_vars], function(x) skewness(x, na.rm = TRUE))

Skew_Before <- round(skew_vals[skewed_vars], 2)
Skew_After <- round(post_skew, 2)

skew_table <- data.frame(
  Variable = skewed_vars,
  Skew_Before = Skew_Before,
  Skew_After = Skew_After
)

knitr::kable(skew_table, caption = "Skewness BEfore and After Log Transformation")
Skewness BEfore and After Log Transformation
Variable Skew_Before Skew_After
CAtBat CAtBat 1.32 -0.68
CHits CHits 1.46 -0.65
CHmRun CHmRun 2.23 -0.44
CRuns CRuns 1.53 -0.62
CRBI CRBI 1.55 -0.49
CWalks CWalks 1.90 -0.51
PutOuts PutOuts 2.04 -2.22
Assists Assists 1.14 -0.29
Salary Salary 1.60 -0.16

Interpretation

Outlier Handling

This code uses ggplot2 to create faceted boxplots for all numeric variables in the Hitters dataset.

It reshapes the data into long format and displays each variable in its own panel using facet_wrap().

Outliers are highlighted in blue to make them visually stand out. This is helpful for quickly identifying extreme values across all features in one compact view.

Hitters_long <- Hitters %>%
  select_if(is.numeric) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(Hitters_long, aes(x = "", y = Value, fill = Variable)) +
  geom_boxplot(outlier.shape = 21, outlier.color = "blue", outlier.fill = "white") +
  facet_wrap(~ Variable, scales = "free", ncol = 4) +
  labs(title = "Boxplots of Variables (Faceted)",
       x = "", y = "Value") +
  theme_minimal(base_size = 12) +
  theme(
    strip.text = element_text(face = "bold", size = 10),
    plot.title = element_text(hjust = 0.5, face = "bold"),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )

Interpretation

Here we detect and remove outliers from the numeric variables in the Hitters dataset using IQR (Interquartile Range) method.

A custom function identifies row indices where values fall outside the range $Q1 - 1.5 IQR, Q3 + 1.5 * IQR, which is a common threshold for defining outliers.

This function is applied across all numeric columns, and all corresponding rows with outliers are collected and removed. The final output prints how many rows were removed and updates the dataset to reflect its new size.

get_outlier_indices <- function(x) {
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  which(x < (Q1 - 1.5 * IQR) | x > (Q3 + 1.5 * IQR))
}

numeric_cols <- sapply(Hitters, is.numeric)
Hitters_numeric <- Hitters[, numeric_cols]

outlier_indices <- unique(unlist(lapply(Hitters_numeric, get_outlier_indices)))

cat("Total outliers removed:", length(outlier_indices), "\n")
## Total outliers removed: 21
Hitters <- Hitters[-outlier_indices, ]

cat("New dataset size:", dim(Hitters)[1], "rows and", dim(Hitters)[2], "columns\n")
## New dataset size: 238 rows and 20 columns
Hitters_long_clean <- Hitters %>%
  select_if(is.numeric) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(Hitters_long_clean, aes(x = "", y = Value, fill = Variable)) +
  geom_boxplot(outlier.shape = 21, outlier.color = "blue", outlier.fill = "white") +
  facet_wrap(~ Variable, scales = "free", ncol = 4) +
  labs(title = "Boxplots of Variables After Outlier Removal (Faceted)",
       x = "", y = "Value") +
  theme_minimal(base_size = 12) +
  theme(
    strip.text = element_text(face = "bold", size = 10),
    plot.title = element_text(hjust = 0.5, face = "bold"),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )

Interpretation

Scaling Numerical Predictors

In this step, we scale all numeric predictors in the dataset using standardization (z-score scaling), which centers the variables at a mean 0 and a standard deviation of 1.

We exclude categorical variables and the target variable Salary from this process, as they should not be scaled. This scaling ensures that all features contribute equally to distance-based models and avoids dominance by variables with larger numeric ranges.

numeric_cols <- sapply(Hitters, is.numeric)

exclude_cols <- c("League", "Division", "NewLeague", "Salary")
numeric_cols[names(numeric_cols) %in% exclude_cols] <- FALSE

Hitters_scaled <- Hitters
Hitters_scaled[, numeric_cols] <- scale(Hitters[, numeric_cols])
Summary of Scaled Variables
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague
Min. :-1.94928 Min. :-1.84561 Min. :-1.3445 Min. :-1.8892 Min. :-1.7057 Min. :-1.8198 Min. :-1.3575 Min. :-2.3617 Min. :-2.3277 Min. :-2.39113 Min. :-2.56997 Min. :-2.7486 Min. :-2.7933 Min. :1.000 Min. :1.0 Min. :-2.54420 Min. :-2.24069 Min. :-1.4023 Min. :4.227 Min. :1.000
1st Qu.:-0.86658 1st Qu.:-0.81352 1st Qu.:-0.7565 1st Qu.:-0.8564 1st Qu.:-0.8330 1st Qu.:-0.8166 1st Qu.:-0.6881 1st Qu.:-0.6702 1st Qu.:-0.7236 1st Qu.:-0.65527 1st Qu.:-0.71556 1st Qu.:-0.6585 1st Qu.:-0.6853 1st Qu.:1.000 1st Qu.:1.0 1st Qu.:-0.63300 1st Qu.:-0.91467 1st Qu.:-0.7672 1st Qu.:5.254 1st Qu.:1.000
Median : 0.05062 Median :-0.03807 Median :-0.2861 Median :-0.1016 Median :-0.1956 Median :-0.1955 Median :-0.2419 Median : 0.1086 Median : 0.1054 Median : 0.09905 Median : 0.09282 Median : 0.1291 Median : 0.1066 Median :1.000 Median :1.5 Median : 0.03888 Median : 0.07404 Median :-0.2909 Median :6.039 Median :1.000
Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean :1.475 Mean :1.5 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean :5.924 Mean :1.471
3rd Qu.: 0.82817 3rd Qu.: 0.75413 3rd Qu.: 0.6547 3rd Qu.: 0.7623 3rd Qu.: 0.7947 3rd Qu.: 0.7480 3rd Qu.: 0.6506 3rd Qu.: 0.7958 3rd Qu.: 0.7527 3rd Qu.: 0.72896 3rd Qu.: 0.73143 3rd Qu.: 0.7180 3rd Qu.: 0.7016 3rd Qu.:2.000 3rd Qu.:2.0 3rd Qu.: 0.48004 3rd Qu.: 0.90812 3rd Qu.: 0.6618 3rd Qu.:6.621 3rd Qu.:2.000
Max. : 1.91949 Max. : 2.86292 Max. : 2.6539 Max. : 2.9570 Max. : 2.7264 Max. : 3.0532 Max. : 2.6588 Max. : 1.6920 Max. : 1.7349 Max. : 1.92503 Max. : 1.63284 Max. : 1.8369 Max. : 2.1348 Max. :2.000 Max. :2.0 Max. : 2.27718 Max. : 1.45696 Max. : 2.7259 Max. :7.808 Max. :2.000

Interpretation

par(mfrow = c(1,2))
hist(Hitters$Years, breaks = 20, main = "Original Years", col = "#c268b7")
hist(Hitters_scaled$Years, breaks = 20, main = "Scaled Years", col = "#76226c")

par(mfrow = c(1,1))

Interpretation

Mean and Standard Deviation of Numeric Variables Before Scaling
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks PutOuts Assists Errors
Mean 408.66 109.71 11.43 55.56 51.49 41.09 7.08 7.46 6.13 3.56 5.42 5.28 5.05 5.4 3.76 8.83
SD 145.01 44.81 8.50 25.17 25.50 20.93 4.48 0.96 1.00 1.20 1.01 1.05 1.02 0.8 1.68 6.30
Mean and Standard Deviation of Scaled Numeric Variables
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks PutOuts Assists Errors
Mean 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
SD 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Interpretation

Checking P-Values

This code fits a multiple linear regression model with Salary as the dependent variable and all other predictors from the scaled dataset as independent variables.

It then extracts the p-values of eacah predictor from the model summary and displays them in ascending order. This helps us evaluate which variables are statistically significant in predicting salary, based on the common threshold.

linear_model <- lm(Salary ~ ., data = Hitters_scaled)

summary_lm <- summary(linear_model)

p_values <- summary_lm$coefficients[, 4]
pval_df <- data.frame(
  Variable = names(p_values),
  P_value = round(p_values, 4)
) %>%
  arrange(P_value)

knitr::kable(pval_df, caption = "P-values for Predictors of Salary (Full Model, No Splitting)")
P-values for Predictors of Salary (Full Model, No Splitting)
Variable P_value
(Intercept) (Intercept) 0.0000
Years Years 0.0000
Walks Walks 0.0207
PutOuts PutOuts 0.0412
AtBat AtBat 0.0527
Division Division 0.0771
CHits CHits 0.1507
Hits Hits 0.2279
Errors Errors 0.3639
RBI RBI 0.4259
Runs Runs 0.5654
CHmRun CHmRun 0.6670
CWalks CWalks 0.6889
CAtBat CAtBat 0.7182
NewLeague NewLeague 0.7468
CRBI CRBI 0.7643
CRuns CRuns 0.8234
HmRun HmRun 0.8388
League League 0.9500
Assists Assists 0.9858

Interpretation

Assessing Correlations

numeric_vars <- Hitters_scaled %>% select_if(is.numeric)

cor_with_salary <- cor(numeric_vars, use = "complete.obs")[, "Salary"]
cor_df <- tibble(
  Variable = names(cor_with_salary),
  Correlation = cor_with_salary
) %>%
  filter(Variable != "Salary") %>%
  arrange(desc(Correlation))

ggplot(cor_df, aes(x = "Salary", y = reorder(Variable, Correlation), fill = Correlation)) +
  geom_tile(color = "white", linewidth = 0.7) +
  geom_text(aes(label = round(Correlation, 2)), size = 4, color = "black") + 
  scale_fill_gradient2(
    low = "#67a9cf",
    mid = "white",
    high = "#d01c8b",
    midpoint = 0,
    limits = c(-1, 1),
    name = "Correlation"
  ) +
  labs(
    title = "Heatmap of Correlation with Salary",
    x = "",
    y = "Variables"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    panel.grid = element_blank(),
    plot.title = element_text(face = "bold", hjust = 0.5)
  )

Interpretation

Assessing Multicollinearity

In this code chunk, we calculates and visualized the full correlation matrix for all variables in the Hitters_scaled dataset. The first step converts any factor variables into numeric from using lapply, then computes pairwise correlations using cor() with complete.obs to handle any remaining NA values.

We then define a custom colors palette using three shades to visually represent the strength and direction of the correlations. Finally, we used ggcorrplot() to generate a heatmap of the matrix, showing only the upper triangle for a cleaner visual layout.

Hitters_num <- as.data.frame((lapply(Hitters_scaled, function(x) {
  if (is.factor(x)) as.numeric(x) else x
})))

# Compute correlation matrix
cor_matrix <- cor(Hitters_num, use = "complete.obs")

custom_colors <- c("#c268b7", "white", "#76226c")

ggcorrplot(cor_matrix,
           method = "square",
           type = "upper",
           lab = FALSE,
           colors = custom_colors,
           title = "Correlation Matrix of All Variables") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold", margin = ggplot2::margin(t = 5)),
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8),
    axis.text.y = element_text(size = 8, margin = ggplot2::margin(r = 5))
  )

Interpretation

datatable(round(cor_matrix, 2),
          caption = "Interactive Correlation Matrix",
          options = list(pageLength = 10, scrollX = TRUE))

Interpretation

This code calculates the Variance Inflation Factor (VIF) for each predictor in our linear regression model using the scaled Hitters dataset.

VIF measures how much the variance of a regression coefficient is inflated due to multicollinearity with other variables.

vif_model <- lm(Salary ~ ., data = Hitters_scaled)

vif_values <- vif(vif_model)

vif_values
##      AtBat       Hits      HmRun       Runs        RBI      Walks      Years 
##  27.957154  38.963208   9.175811  17.514468  13.280277   6.311489   6.332242 
##     CAtBat      CHits     CHmRun      CRuns       CRBI     CWalks     League 
## 398.185151 512.857270  17.634045 112.645847  79.336048  31.066410   4.495162 
##   Division    PutOuts    Assists     Errors  NewLeague 
##   1.058111   1.426682   2.357862   2.245484   4.486147

Interpretation

Using LASSO to keep the needed attributes

This code performs Lasso Regression with cross-validation on the Hitters dataset to identify which predictors best explain salary.

First, it splits the features (x) and the response (y = Salary), then performs Lazzo using cv.glmnet() to find the optimal value of lambda that minimizes the Mean Squared Error(MSE). The final model is built using that best lambda, and the non-zero coefficients (selected variables) are extracted.

A plot is generated showing how MSE changes across different values of log(lambda), highlighting the optimal and 1SE rule lambdas.

x <- as.matrix(Hitters_scaled[, !colnames(Hitters_scaled) %in% "Salary"])
y <- Hitters_scaled$Salary

# Perform Lasso with cross-validation
set.seed(123)
lasso_cv <- cv.glmnet(x, y, alpha = 1, standardize = FALSE)

# Extract optimal lambda values
best_lambda <- lasso_cv$lambda.min
lambda_1se <- lasso_cv$lambda.1se

# Fit final model using best lambda
final_model <- glmnet(x, y, alpha = 1, lambda = best_lambda, standardize = FALSE)

coef_matrix <- coef(final_model)
coef_df <- as.data.frame(as.matrix(coef_matrix))
col_name <- colnames(coef_df)[1]

coef_df <- coef_df %>%
  rownames_to_column("Variable") %>%
  rename(Coefficient = all_of(col_name)) %>%
  filter(Variable != "(Intercept)" & Coefficient != 0)

kable(coef_df, digits = 4, caption = "Variables Selected by Lasso (Non-zero Coefficients)")
Variables Selected by Lasso (Non-zero Coefficients)
Variable Coefficient
AtBat -0.1644
Hits 0.1018
Runs -0.0239
RBI 0.0678
Walks 0.1185
Years -0.2974
CHits 0.9038
CHmRun 0.0396
Division -0.0929
PutOuts 0.0661
Errors -0.0338
NewLeague 0.0264
selected_vars <- coef_df$Variable
Hitters_lasso_final <- Hitters_scaled %>%
  dplyr::select(all_of(selected_vars), Salary)

cv_plot_data <- data.frame(
  lambda = lasso_cv$lambda,
  mse = lasso_cv$cvm,
  se = lasso_cv$cvsd
)

ggplot(cv_plot_data, aes(x = log(lambda), y = mse)) +
  geom_line(color = "#91477f", size = 1) +
  geom_point(color = "#f2739f", size = 2) +
  geom_errorbar(aes(ymin = mse - se, ymax = mse + se), width = 0.15, color = "gray60") +
  geom_vline(xintercept = log(best_lambda), linetype = "dashed", color = "#2b303a", linewidth = 1) +
  geom_vline(xintercept = log(lambda_1se), linetype = "dotted", color = "#2b303a", linewidth = 1) +
  labs(
    title = "Lasso Cross-Validation Curve",
    subtitle = "Optimal Lambda (dashed) vs 1SE Rule (dotted)",
    x = "Log(λ)",
    y = "Mean Squared Error"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", size = 16, color = "#4B0082"),
    plot.subtitle = element_text(size = 12)
  )

Interpretation

Experience vs Performance Correlation

In this part, a subset of the dataset (Hitters_lasso_final) is selected, focusing on four variables: Years, CHmRun, Walks and Runs. These are likely variables reflecting experience and performance in baseball.

The corr() function calculates pairwise Pearson correlation coefficients between them using complete observations only.

Then, the matrix is reshaped into a long format using melt() from the reshape2 package to prepare for visualization,

A heatmap is generated using ggplot2, where aech tile represents the strength and direction of the correlation, with a custom color gradient from red (negative) to blue (positive), and numeric correlation values are overlaid directly on the tiles for easier interpretation.

exp_perf_vars <- Hitters_lasso_final %>%
  dplyr::select(Years, CHmRun, Walks, Runs)

cor_matrix <- cor(exp_perf_vars, use = "complete.obs") %>%
  round(2)

cor_melted <- melt(cor_matrix)

ggplot(cor_melted, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white", size = 0.5) +
  geom_text(aes(label = value), color = "black", size = 4.5) +
  scale_fill_gradient2(low = "#b2182b", mid = "white", high = "#2166ac", 
                       midpoint = 0, limit = c(-1, 1), name = "Corr") +
  labs(title = "Experience vs Performance: Correlation Matrix",
       x = "", y = "") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
        plot.title = element_text(face = "bold", size = 14))

Interpretation

Player Value Score

Here we calculate a custom Player Value Score as a proxy for performance efficiency using the formula:

(Runs + Walks + CHmRun) / (Years + 1)

This score reflects how productive a player is relative to their experience. A cap is applied at the 99th percentile to avoid extreme outliers skewing the plot.

Then, it selects the top 5 most efficient players based on this metric and visualizes two plots:

  1. Player Value vs Salary: A scatter plot with a LOESS smooth line showing general trends.

  2. Top 5 Most Efficient Players: Highlights the top scorers and labels them, helping compare performance to salary visually.

library(dplyr)
library(ggrepel)
Hitters_lasso_final <- Hitters_lasso_final %>%
  mutate(Player_Value_Score = (Runs + Walks + CHmRun) / (Years + 1))

Hitters_lasso_final$Player_Value_Score <- pmin(
  Hitters_lasso_final$Player_Value_Score,
  quantile(Hitters_lasso_final$Player_Value_Score, 0.99)
)

top5 <- Hitters_lasso_final %>%
  top_n(5, wt = Player_Value_Score) %>%
  arrange(desc(Player_Value_Score)) %>%
  mutate(Rank_Label = paste0("Rank ", row_number()))

p1 <- ggplot(Hitters_lasso_final, aes(x = Salary, y = Player_Value_Score)) +
  geom_point(aes(color = Player_Value_Score), size = 2.5, alpha = 0.75) +
  geom_smooth(method = "loess", se = FALSE, color = "darkblue", linetype = "solid") +
  scale_color_viridis_c(option = "C", direction = -1, name = "Value Score") +
  labs(
    title = "Player Value vs. Salary",
    subtitle = "Does higher salary reflect higher performance?",
    x = "Salary (Scaled)",
    y = "Player Value Score"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 12),
    plot.margin = ggplot2::margin(t = 15, r = 20, b = 10, l = 20, unit = "pt")
  )


p2 <- ggplot(Hitters_lasso_final, aes(x = Salary, y = Player_Value_Score)) +
  geom_point(color = "gray85", size = 2, alpha = 0.5) +
  geom_point(data = top5, aes(color = Player_Value_Score), size = 5, alpha = 0.95) +
  geom_text_repel(data = top5, aes(label = Rank_Label),
                  fontface = "bold", size = 4, box.padding = 0.5,
                  color = "black", segment.color = "gray40") +
  scale_color_viridis_c(option = "D", direction = 1, name = "Value Score") +
  labs(
    title = "Top 5 Most Efficient Players",
    subtitle = "Ranked by Player Value Score vs Salary",
    x = "Salary (Scaled)",
    y = "Player Value Score"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 12),
    plot.margin = ggplot2::margin(t = 15, r = 20, b = 10, l = 20, unit = "pt")
  )

p1 / p2 + plot_layout(heights = c(1.1, 1))

Interpretation

Plot 1: Player Value vs Salary

Plot 2: Top 5 Most Efficient Players

Scatterplots to visualize how these variables relate to salaries

Here we generate interactive scatterplots using plotly to visualize how three selected performance variables–Years, AtBat, and Errors–relate to players’ salaries.

The lapply() function loops over each variable, and for each one, a scatterplot is created with players’ Salary on the y-axis and the selected variable on the x-axis.

Tooltips are customized for better interactivity. Finally, subplot() is used to arrange the three plots vertically with shared Y-axes and a centralized title using layout().

features <- c("Years", "AtBat", "Errors")

styled_plots <- lapply(features, function(var) {
  plot_ly(
    data = Hitters_lasso_final,
    x = ~get(var),
    y = ~Salary,
    type = 'scatter',
    mode = 'markers',
    marker = list(color = '#914777', opacity = 0.75, size = 6),
    hovertemplate = paste0(
      "<b>", var, "</b>: %{x:.2f}<br>",
      "<b>Salary</b>: %{y:.2f}<extra></extra>"
    )
  ) %>% layout(
    xaxis = list(title = var, tickfont = list(size = 10), titlefont = list(size = 11)),
    yaxis = list(title = "Salary", tickfont = list(size = 10), titlefont = list(size = 11)),
    margin = list(l = 40, r = 10, b = 40, t = 50),  # increased top margin for the title
    showlegend = FALSE,
    plot_bgcolor = '#fafafa',
    paper_bgcolor = '#fafafa'
  )
})

subplot(styled_plots, nrows = 2, shareY = TRUE, titleX = TRUE, titleY = TRUE, margin = 0.05) %>%
  layout(
    title = list(
      text = "Interactive Scatterplots: Salary vs Performance Metrics",
      x = 0.5, font = list(size = 16, family = "Arial", color = "black")
    )
  )

Interpretation

From the interactive scatterplots:

Salary Distribution by Division

Here we visualize the salary distribution of baseball players from the Hitters dataset, split by Division (East vs West).

The Division column is first re-labeled with factor labels “East” and “West”, and custom colors are defined for each group.

The plot uses geom_density() to draw smooth distribution curves for salary. Styling is applied to make the plot visually appealing with customized legends, titles, and minimalistic theme settings.

Hitters_lasso_final$Division <- Hitters$Division
Hitters_lasso_final$Division <- factor(Hitters_lasso_final$Division, levels = c(1, 2), labels = c("East", "West"))

division_colors <- c("East" = "#914777", "West" = "#c268b7")

ggplot(Hitters_lasso_final, aes(x = Salary, fill = Division)) +
  geom_density(alpha = 0.6, color = NA) +
  scale_fill_manual(values = division_colors) +
  labs(
    title = "Salary Distribution by Division",
    subtitle = "Comparing Scaled Salary Levels between East and West Divisions",
    x = "Scaled Salary",
    y = "Density",
    fill = "Division"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10),
    legend.position = "top",
    legend.title = element_text(face = "bold"),
    legend.text = element_text(size = 10),
    plot.margin = unit(c(12, 20, 10, 20), "pt")  
  )

Interpretation

The resulting density plot shows a comparison of scaled salary distributions between players in the East and West divisions.

This suggests that while salary levels are generally comparable across divisions, salary dispersion is wider in the West, potentially due to more outliers or a broader mix of player experience/performance levels.

Here, we remove the variable that we created to maintain the original predictors.

##  [1] "AtBat"     "Hits"      "Runs"      "RBI"       "Walks"     "Years"    
##  [7] "CHits"     "CHmRun"    "Division"  "PutOuts"   "Errors"    "NewLeague"
## [13] "Salary"

Creating the Salary Levels

Elbow Method

In this chunk, we’re using the Elbow Method to decide how many clusters we should use to segment the players’ salaries.

We first remove NA values from the Salary column and compute the total within-cluster sum of squares (WSS) for clusters from 1 to 10 using kmeans(). Then we calculate how much WSS drops when moving from one cluster to the next, both in absolute and percentage terms.

salary_vector <- na.omit(Hitters_final$Salary)

set.seed(123)

wss <- sapply(1:10, function(k) {
  kmeans(salary_vector, centers = k, nstart = 20)$tot.withinss
})

wss_diff <- c(NA, diff(wss))
wss_pct_drop <- c(NA, round(100 * diff(wss) / head(wss, -1), 2))

elbow_table <- data.frame(
  Clusters = 1:10,
  WSS = round(wss, 0),
  WSS_Drop = round(wss_diff, 0),
  Percent_Drop = wss_pct_drop
)

print(elbow_table)
##    Clusters WSS WSS_Drop Percent_Drop
## 1         1 183       NA           NA
## 2         2  51     -133       -72.42
## 3         3  26      -24       -48.29
## 4         4  15      -11       -43.35
## 5         5   9       -6       -40.41
## 6         6   6       -3       -32.03
## 7         7   5       -1       -22.77
## 8         8   4       -1       -21.82
## 9         9   3       -1       -17.75
## 10       10   2       -1       -21.06

Interpretation

Elbow Plot

Here we create an elbow plot with numerical WSS labels at each point.

It helps visualize how the total within-cluster sum of squares (WSS) changes with different numbers of clusters (K). The plot() function draws the elbow plot, and text() is used to annotate each point with its corresponding WSS value.

plot(1:10, wss, type = "b", pch = 19,
     xlab = "Number of Clusters K",
     ylab = "Total Within-Cluster Sum of Squares",
     main = "Elbow Plot with WSS Labels")
text(1:10, wss, labels = round(wss, 0), pos = 3, cex = 0.8)

Interpretation

Creating the Salary Levels

In this chunk, we are creating a new categorical variable called SalaryLevel by dividing the continuous Salary variable into four levels based on quartiles. This is done using the cut() function, which splits the salary into 4 bins based on its quantile distribution:

We use include.lowest = TRUE to make sure the lowest salary value is included, and na.rm = TRUE to handle any missing data. Then, a frequency table and a barplot are created to show how many players fall into each salary level.

Hitters_final$SalaryLevel <- cut(
  Hitters_final$Salary,
  breaks = quantile(Hitters_final$Salary, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE),
  labels = c("Rookie", "Starter", "Pro", "MVP"),
  include.lowest = TRUE
)

table(Hitters_final$SalaryLevel)
## 
##  Rookie Starter     Pro     MVP 
##      60      59      65      54
barplot(table(Hitters_final$SalaryLevel), col = "darkblue",
        main = "Number of Players per Salary Level",
        ylab = "Count")

Interpretation

Here we remove the Salary attribute to prevent data leakage later on.

names(Hitters_final)
##  [1] "AtBat"       "Hits"        "Runs"        "RBI"         "Walks"      
##  [6] "Years"       "CHits"       "CHmRun"      "Division"    "PutOuts"    
## [11] "Errors"      "NewLeague"   "Salary"      "SalaryLevel"
model_data <- subset(Hitters_final, select = -Salary)

names(model_data)
##  [1] "AtBat"       "Hits"        "Runs"        "RBI"         "Walks"      
##  [6] "Years"       "CHits"       "CHmRun"      "Division"    "PutOuts"    
## [11] "Errors"      "NewLeague"   "SalaryLevel"

Explore Predictors by SalaryLevel

Here we create a 2x2 grid of boxplots to visually explore how four key performance indicators — Years, Hits, RBI, and Walks — vary across the defined SalaryLevel categories (“Rookie”, “Starter”, “Pro”, “MVP”).

Each boxplot shows the distribution of a specific variable across the four salary levels using different colors to help differentiate the plots.

par(mfrow = c(2, 2))  

boxplot(model_data$Years ~ model_data$SalaryLevel, 
        main = "Years vs Salary Level", col = "lightblue", ylab = "Years")

boxplot(model_data$Hits ~ model_data$SalaryLevel, 
        main = "Hits vs Salary Level", col = "lightgreen", ylab = "Hits")

boxplot(model_data$RBI ~ model_data$SalaryLevel, 
        main = "RBI vs Salary Level", col = "lightcoral", ylab = "RBI")

boxplot(model_data$Walks ~ model_data$SalaryLevel, 
        main = "Walks vs Salary Level", col = "plum", ylab = "Walks")

par(mfrow = c(1, 1))

Interpretation

Model Data Summary

This chunk first summarizes all variables in the model_data dataframe. Then, using dplyr and tidyr, it selects only the numeric variables and reshapes them into a long format so each value can be associated with its variable name.

The final ggplot uses geom_boxplot() to visualize the distribution of all numeric variables, showing medians, interquartile ranges, and outliers across the entire dataset.

summary(model_data)
##      AtBat               Hits               Runs              RBI         
##  Min.   :-1.94928   Min.   :-1.84561   Min.   :-1.8892   Min.   :-1.7057  
##  1st Qu.:-0.86658   1st Qu.:-0.81352   1st Qu.:-0.8564   1st Qu.:-0.8330  
##  Median : 0.05062   Median :-0.03807   Median :-0.1016   Median :-0.1956  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.82817   3rd Qu.: 0.75413   3rd Qu.: 0.7623   3rd Qu.: 0.7947  
##  Max.   : 1.91949   Max.   : 2.86292   Max.   : 2.9570   Max.   : 2.7264  
##      Walks             Years             CHits             CHmRun        
##  Min.   :-1.8198   Min.   :-1.3575   Min.   :-2.3277   Min.   :-2.39113  
##  1st Qu.:-0.8166   1st Qu.:-0.6881   1st Qu.:-0.7236   1st Qu.:-0.65527  
##  Median :-0.1955   Median :-0.2419   Median : 0.1054   Median : 0.09905  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.7480   3rd Qu.: 0.6506   3rd Qu.: 0.7527   3rd Qu.: 0.72896  
##  Max.   : 3.0532   Max.   : 2.6588   Max.   : 1.7349   Max.   : 1.92503  
##  Division      PutOuts             Errors          NewLeague      SalaryLevel
##  East:119   Min.   :-2.54420   Min.   :-1.4023   Min.   :1.000   Rookie :60  
##  West:119   1st Qu.:-0.63300   1st Qu.:-0.7672   1st Qu.:1.000   Starter:59  
##             Median : 0.03888   Median :-0.2909   Median :1.000   Pro    :65  
##             Mean   : 0.00000   Mean   : 0.0000   Mean   :1.471   MVP    :54  
##             3rd Qu.: 0.48004   3rd Qu.: 0.6618   3rd Qu.:2.000               
##             Max.   : 2.27718   Max.   : 2.7259   Max.   :2.000
library(ggplot2)
library(dplyr)
library(tidyr)

numeric_vars <- model_data %>% 
  select_if(is.numeric) %>% 
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(numeric_vars, aes(x = Variable, y = Value)) +
  geom_boxplot(fill = "skyblue") +
  labs(title = "Boxplots of Numeric Variables in model_data",
       x = "Variable", y = "Value") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Interpretation

Summary of p-values and z-scores

In this chunk, we’re running a multinomial logistic regression using the multinom() function from the nnet package to model the relationship between several predictors and the SalaryLevel outcome variable (with categories: Rookie, Starter, Pro, MVP).

We compare each level against the baseline (“Rookie”) and extract coefficients, standard errors, z-values, and p-values for interpretation. Then, we combine all of this into a clean table using kable().

library(nnet)
library(knitr)
library(dplyr)

full_model <- multinom(SalaryLevel ~ ., data = model_data, trace = FALSE)

summary_full <- summary(full_model)

coefs <- summary_full$coefficients
std_errs <- summary_full$standard.errors

z_vals <- coefs / std_errs
p_vals <- 2 * (1 - pnorm(abs(z_vals)))

levels <- rownames(coefs)

coef_list <- lapply(1:length(levels), function(i) {
  data.frame(
    Predictor = colnames(coefs),
    Estimate = round(coefs[i, ], 4),
    StdError = round(std_errs[i, ], 4),
    Z_value = round(z_vals[i, ], 3),
    P_value = round(p_vals[i, ], 4),
    Comparison = levels[i]
  )
})

final_table <- bind_rows(coef_list)

kable(final_table, caption = "Multinomial Logistic Regression Results by Salary Level (vs. Rookie)")
Multinomial Logistic Regression Results by Salary Level (vs. Rookie)
Predictor Estimate StdError Z_value P_value Comparison
(Intercept)…1 (Intercept) 3.3682 1.2859 2.619 0.0088 Starter
AtBat…2 AtBat -2.9380 1.6222 -1.811 0.0701 Starter
Hits…3 Hits 0.5993 1.3777 0.435 0.6636 Starter
Runs…4 Runs 0.7695 1.0247 0.751 0.4527 Starter
RBI…5 RBI 0.8891 0.6926 1.284 0.1993 Starter
Walks…6 Walks 0.9943 0.5920 1.679 0.0931 Starter
Years…7 Years -0.0375 1.1667 -0.032 0.9744 Starter
CHits…8 CHits 4.6897 1.2830 3.655 0.0003 Starter
CHmRun…9 CHmRun -0.4984 0.6871 -0.725 0.4682 Starter
DivisionWest…10 DivisionWest -0.4336 0.6200 -0.699 0.4844 Starter
PutOuts…11 PutOuts -0.1894 0.3724 -0.509 0.6110 Starter
Errors…12 Errors -0.4851 0.3974 -1.221 0.2221 Starter
NewLeague…13 NewLeague 0.0254 0.6489 0.039 0.9688 Starter
(Intercept)…14 (Intercept) 2.5340 1.4356 1.765 0.0775 Pro
AtBat…15 AtBat -4.1186 1.8447 -2.233 0.0256 Pro
Hits…16 Hits 2.0939 1.5976 1.311 0.1900 Pro
Runs…17 Runs -0.0571 1.1567 -0.049 0.9606 Pro
RBI…18 RBI 0.8455 0.8637 0.979 0.3276 Pro
Walks…19 Walks 2.1102 0.6948 3.037 0.0024 Pro
Years…20 Years -1.6495 1.2962 -1.273 0.2032 Pro
CHits…21 CHits 8.5713 1.6280 5.265 0.0000 Pro
CHmRun…22 CHmRun -0.2560 0.8501 -0.301 0.7633 Pro
DivisionWest…23 DivisionWest -0.5269 0.7358 -0.716 0.4740 Pro
PutOuts…24 PutOuts 0.2207 0.4342 0.508 0.6112 Pro
Errors…25 Errors -0.1883 0.4589 -0.410 0.6815 Pro
NewLeague…26 NewLeague 0.4121 0.7578 0.544 0.5866 Pro
(Intercept)…27 (Intercept) 0.0246 1.7120 0.014 0.9885 MVP
AtBat…28 AtBat -3.1746 2.0124 -1.578 0.1147 MVP
Hits…29 Hits 1.0602 1.8492 0.573 0.5664 MVP
Runs…30 Runs -0.1673 1.2860 -0.130 0.8965 MVP
RBI…31 RBI 1.3944 0.9999 1.395 0.1631 MVP
Walks…32 Walks 2.3725 0.7396 3.208 0.0013 MVP
Years…33 Years -3.5471 1.4875 -2.385 0.0171 MVP
CHits…34 CHits 13.2905 2.1352 6.224 0.0000 MVP
CHmRun…35 CHmRun -0.0104 1.0599 -0.010 0.9922 MVP
DivisionWest…36 DivisionWest -0.8901 0.8496 -1.048 0.2948 MVP
PutOuts…37 PutOuts 0.6009 0.5010 1.199 0.2304 MVP
Errors…38 Errors -0.3352 0.5033 -0.666 0.5054 MVP
NewLeague…39 NewLeague 0.0775 0.8766 0.088 0.9295 MVP

Interpretation

Cross-Validation

Cross-validation is a resampling method used to evaluate the performance of a model on unseen data.

Instead of splitting the dataset just once into training and testing sets, cross-validation divides the data into multiple folds, trains the model on some folds, and tests it on the remaining ones.

This process is repeated several times, and the performance metrics are averaged to provide a more robust estimate of model accuracy.

In our project, we implemented 10-fold cross-validation using the train() function from the caret package. This allowed us to fairly compare models by evaluating their predictive power on different subsets of the data, ensuring that our results weren’t biased by a specific train-test split.

We applied this method consistently across all models to maintain a reliable and consistent evaluation framework.

LOO-CV method

This chunk defines a method for selecting a cross-validation (CV) approach randomly. First, we set a seed (6102004) to ensure reproducibility of the random sampling. Then, we define a list of possible validation methods:

  1. Vanilla validation set
  2. Leave-One-Out Cross-Validation (LOO-CV)
  3. 5-fold CV
  4. 10-fold CV

Finally, the sample() function randomly picks one method from the list and assigns it to chosen_method, which is then printed.

set.seed(6102004)  # Nikol Tushaj's bday

cv_methods <- c(
  "1. Vanilla validation set",
  "2. Leave-One-Out CV (LOO-CV)",
  "3. K-fold CV (K = 5)",
  "4. K-fold CV (K = 10)"
)

chosen_method <- sample(cv_methods, 1)
chosen_method
## [1] "2. Leave-One-Out CV (LOO-CV)"

Interpretation

Training

In this chunk, we use the train() function from the caret package to train a multinomial logistic regression model using Leave-One-Out Cross-Validation (LOOCV).

We first set up a control object ctrl_loo specifying "LOOCV" as the method, then pass it to the train() function along with the model formula and data.

This means the model is trained 238 times (once for each observation), each time leaving one observation out for testing and using the rest for training.

library(caret)
library(nnet)

set.seed(6102004)  

ctrl_loo <- trainControl(method = "LOOCV")

loo_model <- train(
  SalaryLevel ~ .,
  data = model_data,
  method = "multinom",
  trControl = ctrl_loo,
  trace = FALSE
)
print(loo_model)
## Penalized Multinomial Regression 
## 
## 238 samples
##  12 predictor
##   4 classes: 'Rookie', 'Starter', 'Pro', 'MVP' 
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 237, 237, 237, 237, 237, 237, ... 
## Resampling results across tuning parameters:
## 
##   decay  Accuracy   Kappa    
##   0e+00  0.6428571  0.5229674
##   1e-04  0.6428571  0.5229674
##   1e-01  0.6596639  0.5456302
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was decay = 0.1.

Interpretation

This code uses the trained Leave-One-Out Cross-Validation (LOOCV) multinomial model to predict the salary level for each player. It then compares the predicted values to the actual values using a confusion matrix (confusionMatrix() from the caret package), which summarizes the model’s classification performance.

library(caret)

preds <- predict(loo_model, newdata = model_data)

conf_mat <- confusionMatrix(preds, model_data$SalaryLevel)
acc <- round(conf_mat$overall["Accuracy"], 4)
kappa <- round(conf_mat$overall["Kappa"], 4)

cat("**Model Accuracy:**", acc, "\n")
## **Model Accuracy:** 0.7521
cat("**Cohen's Kappa:**", kappa, "\n")
## **Cohen's Kappa:** 0.6692

Interpretation

library(knitr)
library(kableExtra)

conf_df <- as.data.frame(conf_mat$table)

conf_wide <- tidyr::pivot_wider(
  conf_df,
  names_from = Reference,
  values_from = Freq
)

kable(conf_wide, caption = "Confusion Matrix: Predicted vs Actual Salary Levels") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    position = "center"
  )
Confusion Matrix: Predicted vs Actual Salary Levels
Prediction Rookie Starter Pro MVP
Rookie 54 7 0 0
Starter 3 41 13 1
Pro 3 11 40 9
MVP 0 0 12 44

Interpretation

Analysing metrics

In this chunk, we calculated class-specific performance metrics from the confusion matrix using conf_mat$byClass.

After converting the results into a dataframe and extracting the class labels, we calculated Balanced Accuracy for each class manually using the average of Sensitivity and Specificity.

We then selected the most relevant columns (Class, Sensitivity, Specificity, Balanced Accuracy) and rounded the numeric values to three decimals.

library(dplyr)

by_class <- as.data.frame(conf_mat$byClass)
by_class$Class <- rownames(conf_mat$byClass)
rownames(by_class) <- NULL

by_class$Balanced.Accuracy <- round((by_class$Sensitivity + by_class$Specificity) / 2, 3)

class_metrics <- by_class %>%
  dplyr::select(Class, Sensitivity, Specificity, Balanced.Accuracy) %>%
  mutate(across(where(is.numeric), ~ round(.x, 3)))

kable(class_metrics, caption = "Class-wise Performance Metrics") %>%
  kable_styling(
    full_width = FALSE,
    bootstrap_options = c("striped", "hover", "condensed")
  )
Class-wise Performance Metrics
Class Sensitivity Specificity Balanced.Accuracy
Class: Rookie 0.900 0.961 0.930
Class: Starter 0.695 0.905 0.800
Class: Pro 0.615 0.867 0.741
Class: MVP 0.815 0.935 0.875

Interpretation

model_data_full <- model_data

Stepwise model selection using AIC

In this code, we applied stepwise model selection using AIC (Akaike Information Criterion) to find a more parsimonious version of our multinomial logistic regression model.

We started with a full model (full_model_step) that included all predictors in model_data_full, and then used stepAIC() with direction = "both" to allow both forward selection and backward elimination.

This means variables could be added or removed iteratively based on which changes improved the AIC the most.

library(MASS)
library(nnet)

full_model_step <- multinom(SalaryLevel ~ ., data = model_data_full, trace = FALSE)

stepwise_model <- stepAIC(full_model_step, direction = "both", trace = FALSE)

summary(stepwise_model)
## Call:
## multinom(formula = SalaryLevel ~ AtBat + RBI + Walks + Years + 
##     CHits + PutOuts, data = model_data_full, trace = FALSE)
## 
## Coefficients:
##         (Intercept)     AtBat       RBI    Walks      Years     CHits
## Starter   3.1465374 -1.674272 0.5392789 1.072940  0.2803077  4.075458
## Pro       2.7658801 -2.025021 0.6854214 1.967439 -1.2810930  8.222139
## MVP      -0.3324148 -2.351867 1.4056373 2.259785 -2.9102031 12.765125
##            PutOuts
## Starter -0.2032505
## Pro      0.1624575
## MVP      0.6621590
## 
## Std. Errors:
##         (Intercept)     AtBat       RBI     Walks    Years    CHits   PutOuts
## Starter   0.7397138 0.6499693 0.4282029 0.4738105 1.072957 1.032214 0.3500595
## Pro       0.7531370 0.7439332 0.5169125 0.5656089 1.204222 1.372753 0.4117279
## MVP       0.9928219 0.8427625 0.6098148 0.6118005 1.360912 1.814912 0.4757293
## 
## Residual Deviance: 317.2372 
## AIC: 359.2372

Interpretation

Evaluating the reduced model

In this section, we’re evaluating the performance of our reduced multinomial logistic regression model—a simpler model derived through stepwise AIC selection. We’re applying Leave-One-Out Cross-Validation (LOOCV) to this reduced model to ensure a fair and consistent comparison with the full model.

library(caret)
library(nnet)

set.seed(6102004)

ctrl_loo <- trainControl(method = "LOOCV")

reduced_model <- train(
  formula(stepwise_model),
  data = model_data_full,
  method = "multinom",
  trControl = ctrl_loo,
  trace = FALSE
)
print(reduced_model)
## Penalized Multinomial Regression 
## 
## 238 samples
##   6 predictor
##   4 classes: 'Rookie', 'Starter', 'Pro', 'MVP' 
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 237, 237, 237, 237, 237, 237, ... 
## Resampling results across tuning parameters:
## 
##   decay  Accuracy   Kappa    
##   0e+00  0.6680672  0.5564415
##   1e-04  0.6680672  0.5564415
##   1e-01  0.6512605  0.5338839
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was decay = 1e-04.

Interpretation

In this block, we’re evaluating the prediction performance of the stepwise-selected multinomial model.

We generate predictions on the full dataset using the reduced model, compute the confusion matrix comparing predicted vs actual salary levels, and format it nicely with kableExtra for display.

The goal is to assess how well the reduced model classifies players across the four salary categories (Rookie, Starter, Pro, MVP).

library(caret)
library(knitr)
library(kableExtra)

stepwise_preds <- predict(stepwise_model, newdata = model_data_full)

conf_mat_stepwise <- confusionMatrix(stepwise_preds, model_data_full$SalaryLevel)

conf_df <- as.data.frame(conf_mat_stepwise$table)

conf_wide <- tidyr::pivot_wider(conf_df,
                                names_from = Reference,
                                values_from = Freq)

kable(conf_wide, caption = "Confusion Matrix: Stepwise Model (Predicted vs Actual Salary Levels)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE,
                position = "center")
Confusion Matrix: Stepwise Model (Predicted vs Actual Salary Levels)
Prediction Rookie Starter Pro MVP
Rookie 54 6 0 0
Starter 3 41 13 0
Pro 2 11 38 14
MVP 1 1 14 40

Interpretation

Comparing the reduced model and the full one

In this section, we compared the performance of the full and reduced multinomial logistic regression models by extracting their respective accuracy and Cohen’s Kappa scores from their confusion matrices.

These values were then stored in a data frame and displayed in a clean table format for easy comparison.

full_acc <- round(conf_mat$overall["Accuracy"], 4)
full_kappa <- round(conf_mat$overall["Kappa"], 4)

reduced_acc <- round(conf_mat_stepwise$overall["Accuracy"], 4)
reduced_kappa <- round(conf_mat_stepwise$overall["Kappa"], 4)

acc_kappa_df <- data.frame(
  Model = c("Full Model", "Reduced Model"),
  Accuracy = c(full_acc, reduced_acc),
  Kappa = c(full_kappa, reduced_kappa)
)

library(knitr)
library(kableExtra)

kable(acc_kappa_df, caption = "Accuracy & Kappa: Full vs Reduced Model") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE,
                position = "center")
Accuracy & Kappa: Full vs Reduced Model
Model Accuracy Kappa
Full Model 0.7521 0.6692
Reduced Model 0.7269 0.6354

Interpretation

In this step, we compared the confusion matrices of the full and reduced models side by side using a heatmap visualization. The matrices were converted into data frames and labeled accordingly, then merged into one dataset for plotting.

library(ggplot2)

conf_full_df <- as.data.frame(conf_mat$table)
conf_reduced_df <- as.data.frame(conf_mat_stepwise$table)
conf_full_df$Model <- "Full"
conf_reduced_df$Model <- "Reduced"

conf_all <- rbind(conf_full_df, conf_reduced_df)

ggplot(conf_all, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile(color = "white", width = 0.9, height = 0.9) +
  geom_text(aes(label = Freq), color = "black", size = 5) +
  scale_fill_gradient(low = "#ffb3d9", high = "#660066") +  #
  facet_wrap(~Model, nrow = 1) +  
  labs(
    title = "Confusion Matrices: Full vs Reduced Model",
    x = "Actual", y = "Predicted"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    strip.text = element_text(face = "bold", size = 14),
    plot.title = element_text(hjust = 0.5, size = 16),
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 13)
  )

Interpretation

Vanilla validation method

The vanilla validation method refers to a simple train-test split, where the dataset is divided into two parts: one for training the model and the other for testing its performance. In our case, we used the caTools package to randomly split the data so that 80% is used for training (train_data) and the remaining 20% for testing (test_data). The split is stratified based on the target variable SalaryLevel to preserve class proportions across the sets.

We chose this approach because it’s a straightforward and computationally light way to evaluate model performance, especially when dealing with reasonably sized datasets. This method allows us to simulate real-world prediction tasks by training on one portion of the data and testing on unseen data to assess generalization accuracy.

library(caTools)

set.seed(6102004)

split <- sample.split(model_data_full$SalaryLevel, SplitRatio = 0.8)

train_data <- subset(model_data_full, split == TRUE)
test_data <- subset(model_data_full, split == FALSE)

Training on the training set with the vanilla validation method

We are continuing our vanilla modeling process, where we apply a multinomial logistic regression to the training data (which was previously split using the vanilla 80/20 split). This code block trains the model on the train_data and displays the model summary, which includes coefficient estimates, standard errors, residual deviance, and AIC.

library(nnet)

vanilla_model <- multinom(SalaryLevel ~ ., data = train_data, trace = FALSE)
summary(vanilla_model)
## Call:
## multinom(formula = SalaryLevel ~ ., data = train_data, trace = FALSE)
## 
## Coefficients:
##         (Intercept)     AtBat      Hits      Runs      RBI     Walks      Years
## Starter   3.1899758 -3.627942 1.0420699 0.6584629 1.406420 0.5519796  0.1594883
## Pro       2.4185526 -4.905749 2.4735754 0.4483905 1.408011 1.4665724 -1.5880947
## MVP       0.3240155 -3.223980 0.3038925 0.7428968 2.277004 1.7460073 -3.2610394
##             CHits     CHmRun DivisionWest    PutOuts     Errors NewLeague
## Starter  5.487762 -0.6177391   -0.8706095 -0.1544943 -0.4234180 0.6556809
## Pro      9.686622 -0.5235629   -0.9806671  0.1341253 -0.2609225 0.9321448
## MVP     14.725130 -0.5496685   -1.5136167  0.5480782 -0.4209294 0.2098092
## 
## Std. Errors:
##         (Intercept)    AtBat     Hits     Runs       RBI     Walks    Years
## Starter    1.721113 2.131758 1.768098 1.270791 0.9716359 0.8030795 1.511899
## Pro        1.867406 2.357700 1.958366 1.414882 1.1346763 0.8999374 1.646195
## MVP        2.138189 2.561209 2.309380 1.579893 1.2669626 0.9579582 1.854481
##            CHits    CHmRun DivisionWest   PutOuts    Errors NewLeague
## Starter 1.662908 0.8347009    0.8450726 0.4961481 0.5524199 0.9446413
## Pro     2.055991 1.0169626    0.9564956 0.5656962 0.6209299 1.0488414
## MVP     2.630358 1.2272514    1.0857020 0.6358958 0.6665705 1.1740851
## 
## Residual Deviance: 224.9579 
## AIC: 302.9579

Interpretation

Testing with the vanilla validation method

In this chunk, we’re testing the vanilla multinomial regression model on the test data to evaluate its performance. The code uses the trained vanilla_model to generate predictions for the test set (vanilla_preds) and then compares these predictions with the actual salary levels using a confusion matrix (vanilla_conf_mat).

This helps us assess how well the model generalizes to unseen data, which is essential for understanding its real-world predictive ability.

library(caret)

vanilla_preds <- predict(vanilla_model, newdata = test_data)

vanilla_conf_mat <- confusionMatrix(vanilla_preds, test_data$SalaryLevel)

This code extracts and prints the accuracy and Cohen’s Kappa score from the vanilla model’s confusion matrix. It rounds both metrics to four decimal places for readability, then outputs them using the cat() function. These two metrics are crucial for evaluating the model’s overall classification performance and agreement beyond chance.

acc_vanilla <- round(vanilla_conf_mat$overall["Accuracy"], 4)
kappa_vanilla <- round(vanilla_conf_mat$overall["Kappa"], 4)

cat("**Vanilla Accuracy:**", acc_vanilla, "\n")
## **Vanilla Accuracy:** 0.6458
cat("**Vanilla Cohen’s Kappa:**", kappa_vanilla, "\n")
## **Vanilla Cohen’s Kappa:** 0.5275

Interpretation

In this chunk, we’re visualizing the performance of our vanilla (simple train/test split) model using a confusion matrix.

library(knitr)
library(kableExtra)

vanilla_table <- as.data.frame(vanilla_conf_mat$table)
vanilla_table_wide <- tidyr::pivot_wider(vanilla_table, names_from = Reference, values_from = Freq)

kable(vanilla_table_wide, caption = "Vanilla Validation: Confusion Matrix") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
Vanilla Validation: Confusion Matrix
Prediction Rookie Starter Pro MVP
Rookie 10 3 0 0
Starter 1 7 4 0
Pro 1 2 6 3
MVP 0 0 3 8

Interpretation

Comparing between the metrics

In this part, we are creating a final visual comparison of the model performance metrics—Accuracy and Cohen’s Kappa—across three different models: Full LOO-CV, Reduced LOO-CV, and Vanilla validation. We first build a data frame that stores the metric values for each model. Then, we reshape it into long format using pivot_longer() so it’s easier to plot with ggplot2.

The bar chart is constructed to display both Accuracy and Kappa side by side for each model. This allows us to directly visualize how model performance varies depending on the validation strategy and model complexity.

compare_df <- data.frame(
  Model = c("Full LOO-CV", "Reduced LOO-CV", "Vanilla"),
  Accuracy = c(0.6597, 0.6681, 0.6458),
  Kappa = c(0.5456, 0.5564, 0.5275)
)

library(ggplot2)

library(tidyr)
compare_long <- pivot_longer(compare_df, cols = c("Accuracy", "Kappa"))

ggplot(compare_long, aes(x = Model, y = value, fill = name)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Model Comparison: Accuracy and Kappa",
       y = "Metric Value", x = "", fill = "Metric") +
  theme_minimal()

Interpretation

library(caret)

reduced_preds <- predict(stepwise_model, newdata = model_data_full)

conf_mat_reduced <- confusionMatrix(reduced_preds, model_data_full$SalaryLevel)

In this section, we combine the confusion matrices from all three models—Full, Reduced (stepwise), and Vanilla—into a single dataset for visual comparison. Each matrix is first transformed into a data frame, and a new column called Model is added to distinguish them.

library(ggplot2)
library(dplyr)
library(tidyr)

full_cm_df <- as.data.frame(conf_mat$table) %>% mutate(Model = "Full")
reduced_cm_df <- as.data.frame(conf_mat_reduced$table) %>% mutate(Model = "Reduced")
vanilla_cm_df <- as.data.frame(vanilla_conf_mat$table) %>% mutate(Model = "Vanilla")

combined_cm <- bind_rows(full_cm_df, reduced_cm_df, vanilla_cm_df)

ggplot(combined_cm, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), color = "black", size = 3) +
  facet_wrap(~Model, ncol = 3) +
  scale_fill_gradient(low = "white", high = "steelblue") +
  labs(title = "Confusion Matrices: Full vs Reduced vs Vanilla", fill = "Freq") +
  theme_minimal()

Interpretation

Rain Forest

We are choosing the Random Forest model because it offers a strong balance between predictive power and interpretability, especially in complex multiclass classification problems like ours. While logistic regression models (full, reduced, and vanilla) give us useful baseline insights and allow for variable significance testing, they rely on linear boundaries and can underperform when relationships between predictors and the target are non-linear or involve interactions. In contrast, Random Forest can capture these non-linearities and interactions naturally, without requiring extensive feature engineering.

Moreover, Random Forest is an ensemble method that builds multiple decision trees and averages their results, reducing variance and improving generalization—making it more robust to overfitting than single-tree models. It also provides useful variable importance metrics, helping us understand which features contribute most to predictions. Given our goal of accurately classifying players into salary levels based on a variety of performance metrics, Random Forest gives us a powerful, flexible, and relatively interpretable tool to boost classification performance beyond what standard regression models can offer.

Training with Random Forest

In this section, we trained a Random Forest classifier using the full training dataset (train_data) to predict the categorical SalaryLevel variable.

The model was trained with 500 trees and 3 predictors tried at each split (a default setting for classification tasks).

library(randomForest)

rf_model <- randomForest(SalaryLevel ~ ., data = train_data, importance = TRUE)

print(rf_model)
## 
## Call:
##  randomForest(formula = SalaryLevel ~ ., data = train_data, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 34.21%
## Confusion matrix:
##         Rookie Starter Pro MVP class.error
## Rookie      41       4   3   0   0.1458333
## Starter      5      31  10   1   0.3404255
## Pro          0      15  26  11   0.5000000
## MVP          0       2  14  27   0.3720930

Interpretation

rf_preds <- predict(rf_model, newdata = test_data)


library(caret)
rf_conf_mat <- confusionMatrix(rf_preds, test_data$SalaryLevel)

This part extracts the accuracy and Cohen’s Kappa score from the confusion matrix of the trained Random Forest model and prints them in a readable format.

acc_rf <- round(rf_conf_mat$overall["Accuracy"], 4)
kappa_rf <- round(rf_conf_mat$overall["Kappa"], 4)

cat("**Random Forest Accuracy:**", acc_rf, "\n")
## **Random Forest Accuracy:** 0.6875
cat("**Random Forest Cohen’s Kappa:**", kappa_rf, "\n")
## **Random Forest Cohen’s Kappa:** 0.5831

Interpretation

library(knitr)
library(kableExtra)

rf_table_not_tuned <- as.data.frame(rf_conf_mat$table)
rf_table_wide <- tidyr::pivot_wider(rf_table_not_tuned, names_from = Reference, values_from = Freq)

kable(rf_table_wide, caption = "Random Forest: Confusion Matrix") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
Random Forest: Confusion Matrix
Prediction Rookie Starter Pro MVP
Rookie 11 4 0 0
Starter 0 7 3 0
Pro 1 1 7 3
MVP 0 0 3 8

Interpretation

library(dplyr)
library(knitr)
library(kableExtra)

rf_by_class_not_tuned <- as.data.frame(rf_conf_mat$byClass)
rf_by_class_not_tuned$Class <- rownames(rf_conf_mat$byClass)
rownames(rf_by_class_not_tuned) <- NULL

rf_by_class_not_tuned$Class <- gsub("Class: ", "", rf_by_class_not_tuned$Class)

rf_by_class_not_tuned$Balanced.Accuracy <- round((rf_by_class_not_tuned$Sensitivity + rf_by_class_not_tuned$Specificity) / 2, 3)

rf_class_metrics_not_tuned <- rf_by_class_not_tuned %>%
  dplyr::select(Class, Sensitivity, Specificity, Balanced.Accuracy) %>%
  mutate(across(where(is.numeric), ~ round(.x, 3)))

kable(rf_class_metrics_not_tuned, caption = "Random Forest: Class-wise Performance Metrics") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
Random Forest: Class-wise Performance Metrics
Class Sensitivity Specificity Balanced.Accuracy
Rookie 0.917 0.889 0.903
Starter 0.583 0.917 0.750
Pro 0.538 0.857 0.698
MVP 0.727 0.919 0.823

Interpretation

Hyperparameter Tuning on RF

We are performing hyperparameter tuning on the Random Forest model to improve its predictive performance and generalization.

By using 10-fold cross-validation (cv) with tuneLength = 10, we let the model try different values of mtry (number of variables considered at each split) and choose the one that results in the best accuracy.

This helps reduce overfitting and ensures that the model performs well not just on the training data but also on unseen data. The code accomplishes this using the caret package’s train() function, specifying the tuning process through trainControl() and enabling feature importance output.

library(caret)

set.seed(123)
rf_tuned <- train(SalaryLevel ~ ., 
                  data = train_data,
                  method = "rf",
                  trControl = trainControl(method = "cv", number = 10),
                  tuneLength = 10,
                  importance = TRUE)
rf_tuned$bestTune
##   mtry
## 1    2

Interpretation

plot(rf_tuned)

Interpretation

In this code, we are evaluating the performance of the tuned Random Forest model on the test data. First, predictions (rf_tuned_preds) are generated using the predict() function. Then, we assess how well the model performed by comparing these predictions to the actual salary levels in the test set using a confusion matrix.

rf_tuned_preds <- predict(rf_tuned, newdata = test_data)
conf_mat_rf_tuned <- confusionMatrix(rf_tuned_preds, test_data$SalaryLevel, mode = "everything")

Comparin the Tuned and simple version of Random Forest

In this code, we are comparing the class-wise performance metrics of the untuned and tuned Random Forest models.

We extract sensitivity, specificity, and balanced accuracy for each salary class from the respective confusion matrices.

These values are cleaned and labeled to reflect the model they come from. The shared columns between both models are then selected and the results are combined into a single dataframe.

Finally, the metrics are neatly displayed using a formatted table to assess how tuning the Random Forest impacted performance at the class level.

rf_class_metrics <- as.data.frame(rf_conf_mat$byClass)
rf_class_metrics$Class <- gsub("Class: ", "", rownames(rf_class_metrics))
rf_class_metrics$Model <- "RF Untuned"

rf_tuned_class_metrics <- as.data.frame(conf_mat_rf_tuned$byClass)
rf_tuned_class_metrics$Class <- gsub("Class: ", "", rownames(rf_tuned_class_metrics))
rf_tuned_class_metrics$Model <- "RF Tuned"

colnames(rf_class_metrics)[colnames(rf_class_metrics) == "Balanced Accuracy"] <- "Balanced Accuracy"
colnames(rf_tuned_class_metrics)[colnames(rf_tuned_class_metrics) == "Balanced Accuracy"] <- "Balanced Accuracy"

cols_to_use <- intersect(
  c("Model", "Class", "Sensitivity", "Specificity", "Balanced Accuracy"),
  intersect(colnames(rf_class_metrics), colnames(rf_tuned_class_metrics))
)

rf_class_metrics <- rf_class_metrics[, cols_to_use]
rf_tuned_class_metrics <- rf_tuned_class_metrics[, cols_to_use]
rf_comparison <- rbind(rf_class_metrics, rf_tuned_class_metrics)

kable(rf_comparison, caption = "Class-wise Performance: RF Untuned vs RF Tuned") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
Class-wise Performance: RF Untuned vs RF Tuned
Model Class Sensitivity Specificity Balanced Accuracy
Class: Rookie RF Untuned Rookie 0.9166667 0.8888889 0.9027778
Class: Starter RF Untuned Starter 0.5833333 0.9166667 0.7500000
Class: Pro RF Untuned Pro 0.5384615 0.8571429 0.6978022
Class: MVP RF Untuned MVP 0.7272727 0.9189189 0.8230958
Class: Rookie1 RF Tuned Rookie 0.9166667 0.8888889 0.9027778
Class: Starter1 RF Tuned Starter 0.5833333 0.9444444 0.7638889
Class: Pro1 RF Tuned Pro 0.6923077 0.8857143 0.7890110
Class: MVP1 RF Tuned MVP 0.8181818 0.9459459 0.8820639

Interpretation

In this code chunk, we are displaying the confusion matrix for the tuned Random Forest model using kable.

This format allows us to easily see where the model made correct or incorrect predictions. The styled output table is then rendered using the kableExtra package for better visual presentation.

library(knitr)
library(kableExtra)

rf_table <- as.data.frame(conf_mat_rf_tuned$table)
rf_table_wide <- tidyr::pivot_wider(rf_table, names_from = Reference, values_from = Freq)

kable(rf_table_wide, caption = "Random Forest Tuned: Confusion Matrix") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
Random Forest Tuned: Confusion Matrix
Prediction Rookie Starter Pro MVP
Rookie 11 4 0 0
Starter 0 7 2 0
Pro 1 1 9 2
MVP 0 0 2 9

Interpretation

In this section, we aim to extract and summarize the class-wise performance metrics from the tuned Random Forest model.

Specifically, we gather values for Sensitivity, Specificity, and Balanced Accuracy for each salary class, round them for clarity, and organize the results into a readable table format.

library(dplyr)
library(knitr)
library(kableExtra)

rf_by_class <- as.data.frame(conf_mat_rf_tuned$byClass)
rf_by_class$Class <- rownames(conf_mat_rf_tuned$byClass)
rownames(rf_by_class) <- NULL

rf_by_class$Class <- gsub("Class: ", "", rf_by_class$Class)

rf_by_class$Balanced.Accuracy <- round((rf_by_class$Sensitivity + rf_by_class$Specificity) / 2, 3)

rf_class_metrics <- rf_by_class %>%
  dplyr::select(Class, Sensitivity, Specificity, Balanced.Accuracy) %>%
  mutate(across(where(is.numeric), ~ round(.x, 3)))

kable(rf_class_metrics, caption = "Random Forest Tuned: Class-wise Performance Metrics") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
Random Forest Tuned: Class-wise Performance Metrics
Class Sensitivity Specificity Balanced.Accuracy
Rookie 0.917 0.889 0.903
Starter 0.583 0.944 0.764
Pro 0.692 0.886 0.789
MVP 0.818 0.946 0.882

Interpretation

Comparing all models’ metrics

rf_cm_df <- as.data.frame(rf_conf_mat$table) %>%
  mutate(Model = "Random Forest")
combined_cm <- bind_rows(full_cm_df, reduced_cm_df, vanilla_cm_df, rf_cm_df)
library(ggplot2)

ggplot(combined_cm, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), color = "black") +
  facet_wrap(~ Model) +
  scale_fill_gradient(low = "white", high = "steelblue") +
  labs(title = "Confusion Matrices: All Models Compared",
       fill = "Frequency") +
  theme_minimal()

Interpretation

KNN

We chose to include K-Nearest Neighbors (KNN) in our model comparison because it’s a simple, intuitive, and non-parametric method that classifies observations based on the majority class among the closest training examples.

Unlike models that assume linear relationships (like logistic regression), KNN captures local patterns in the data, which can be especially useful in a multi-class setting like ours. It also helps us evaluate how well a distance-based approach performs compared to more complex ensemble models like Random Forest, offering valuable contrast in terms of both methodology and outcomes.

This code prepares for KNN modeling using the caret package. We set a random seed for reproducibility and configure 10-fold cross-validation to evaluate model performance. The settings are stored in knn_ctrl for later use.

library(caret)

set.seed(123)
knn_ctrl <- trainControl(method = "cv", number = 10)

Here we train our KNN model to predict salary levels (SalaryLevel) using all the other variables in our training data. We’re using the settings we made earlier for 10-fold cross-validation (knn_ctrl). The tuneLength = 10 part means the model will automatically try 10 different values for k (the number of neighbors) to find which works best. The trained model gets saved as knn_model so we can check how well it performs later.

knn_model <- train(SalaryLevel ~ .,
                   data = train_data,
                   method = "knn",
                   trControl = knn_ctrl,
                   tuneLength = 10)  
knn_model$bestTune
##     k
## 10 23

Interpretation

plot(knn_model)

Interpretation

The graph shows how the model’s accuracy changes as we use different numbers of neighbors (k). When k=5, accuracy is around 62%, which is the highest. As we increase k, accuracy drops—down to about 57% at k=20.

What this means:

Here we are making predictions on the salary level using KNN on the test data.

knn_preds <- predict(knn_model, newdata = test_data)

Here we are creating a confusion matrix with all the metrics of KNN’s performance.

conf_mat_knn <- confusionMatrix(knn_preds, test_data$SalaryLevel, mode = "everything")

Here we’re summarizing how the KNN model performed for each salary class by calculating sensitivity, specificity, and balanced accuracy. This helps us see how well the model is able to correctly identify each class, not just its overall accuracy.

library(dplyr)
library(knitr)
library(kableExtra)

knn_by_class <- as.data.frame(conf_mat_knn$byClass)
knn_by_class$Class <- rownames(conf_mat_knn$byClass)
rownames(knn_by_class) <- NULL

knn_by_class$Class <- gsub("Class: ", "", knn_by_class$Class)

knn_by_class$Balanced.Accuracy <- round((knn_by_class$Sensitivity +
                                       knn_by_class$Specificity) / 2, 3)

knn_class_metrics <- knn_by_class %>%  
    dplyr::select(Class, Sensitivity, Specificity, Balanced.Accuracy) %>%  
    mutate(across(where(is.numeric), ~ round(.x, 3)))

kable(knn_class_metrics, caption = "KNN: Class-wise Performance Metrics") %>%  
    kable_styling(full_width = FALSE, 
                bootstrap_options = c("striped", "hover", "condensed"))
KNN: Class-wise Performance Metrics
Class Sensitivity Specificity Balanced.Accuracy
Rookie 0.833 0.917 0.875
Starter 0.333 0.889 0.611
Pro 0.538 0.657 0.598
MVP 0.455 0.919 0.687

Interpretation

Here we are displaying the confusion matrix for KNN in a clean format so we can easily see how many observations were correctly or incorrectly predicted for each salary level.

library(knitr)
library(kableExtra)
library(tidyr)

knn_table <- as.data.frame(conf_mat_knn$table)
knn_table_wide <- pivot_wider(knn_table, 
                            names_from = Reference, 
                            values_from = Freq)

kable(knn_table_wide, caption = "KNN: Confusion Matrix") %>%
    kable_styling(full_width = FALSE, 
                bootstrap_options = c("striped", "hover", "condensed"))
KNN: Confusion Matrix
Prediction Rookie Starter Pro MVP
Rookie 10 3 0 0
Starter 0 4 4 0
Pro 2 4 7 6
MVP 0 1 2 5

Interpretation

In this code chunk, we are building a combined table to compare the class-wise performance of three models: the tuned Random Forest, untuned Random Forest, and k-NN. For each model, we collect the key performance metrics — Sensitivity, Specificity, and Balanced Accuracy — and label them with the corresponding model name. These tables are then merged into one, which is printed using a styled table to visualize and directly compare how well each model performs across different salary classes.

library(dplyr)
library(kableExtra)

rf_class_metrics$Model <- "RF Tuned"
rf_class_metrics_not_tuned$Model <- "RF Untuned"
knn_class_metrics$Model <- "k-NN"

colnames(rf_class_metrics)[colnames(rf_class_metrics) == "Balanced.Accuracy"] <- "Balanced Accuracy"
colnames(rf_class_metrics_not_tuned)[colnames(rf_class_metrics_not_tuned) == "Balanced.Accuracy"] <- "Balanced Accuracy"
colnames(knn_class_metrics)[colnames(knn_class_metrics) == "Balanced.Accuracy"] <- "Balanced Accuracy"

cols_to_use <- c("Model", "Class", "Sensitivity", "Specificity", "Balanced Accuracy")

model_comparison <- bind_rows(
  rf_class_metrics[, cols_to_use],
  rf_class_metrics_not_tuned[, cols_to_use],
  knn_class_metrics[, cols_to_use]
)

kable(model_comparison, caption = "Class-wise Performance Comparison: RF Tuned vs RF Untuned vs k-NN") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
Class-wise Performance Comparison: RF Tuned vs RF Untuned vs k-NN
Model Class Sensitivity Specificity Balanced Accuracy
RF Tuned Rookie 0.917 0.889 0.903
RF Tuned Starter 0.583 0.944 0.764
RF Tuned Pro 0.692 0.886 0.789
RF Tuned MVP 0.818 0.946 0.882
RF Untuned Rookie 0.917 0.889 0.903
RF Untuned Starter 0.583 0.917 0.750
RF Untuned Pro 0.538 0.857 0.698
RF Untuned MVP 0.727 0.919 0.823
k-NN Rookie 0.833 0.917 0.875
k-NN Starter 0.333 0.889 0.611
k-NN Pro 0.538 0.657 0.598
k-NN MVP 0.455 0.919 0.687

Interpretation

In this section, we are collecting and comparing the overall performance metrics—Accuracy and Cohen’s Kappa—for three different models: Random Forest Tuned, Random Forest Untuned, and k-Nearest Neighbors (k-NN). The values are extracted from each model’s confusion matrix and organized into a single summary table for easy comparison.

overall_metrics <- data.frame(
  Model = c("RF Tuned", "RF Untuned", "k-NN"),
  Accuracy = c(
    round(conf_mat_rf_tuned$overall["Accuracy"], 4),
    round(rf_conf_mat$overall["Accuracy"], 4),
    round(conf_mat_knn$overall["Accuracy"], 4)
  ),
  Kappa = c(
    round(conf_mat_rf_tuned$overall["Kappa"], 4),
    round(rf_conf_mat$overall["Kappa"], 4),
    round(conf_mat_knn$overall["Kappa"], 4)
  )
)

kable(overall_metrics, caption = "Overall Accuracy & Kappa: RF Tuned vs RF Untuned vs k-NN") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
Overall Accuracy & Kappa: RF Tuned vs RF Untuned vs k-NN
Model Accuracy Kappa
RF Tuned 0.7500 0.6663
RF Untuned 0.6875 0.5831
k-NN 0.5417 0.3850

Interpretation

♥ Conclusion ♥

In this project, we set out to predict player salary levels using the Hitters dataset by applying and comparing multiple classification models. After cleaning the data and preparing the variables, we explored multinomial logistic regression, Random Forest (both tuned and untuned), and k-nearest neighbors (k-NN). The focus wasn’t just on building models but also on understanding their performance using meaningful evaluation metrics like accuracy, Cohen’s kappa, and class-wise sensitivity, specificity, and balanced accuracy.

From our results, the tuned Random Forest consistently performed the best. It achieved the highest overall accuracy (75%) and Cohen’s kappa (0.6663), suggesting strong agreement between the predicted and actual salary classes. It also delivered strong class-wise performance, especially for higher-level salary classes like Pro and MVP. The untuned Random Forest still performed fairly well but was slightly behind the tuned version in all metrics. On the other hand, k-NN showed the weakest performance, with an accuracy of 50% and a much lower kappa score (0.3850), particularly struggling with distinguishing higher salary classes.

Overall, this comparison highlights the value of tuning machine learning models and choosing algorithms that can handle class imbalance effectively. While simpler models like k-NN are easy to implement, they may not always perform well in more complex classification tasks. The Random Forest, especially when tuned, showed the best balance between accuracy and generalization, making it the most reliable model for this dataset.

Our group had some assistance from OpenAI for debugging and occasional help with understanding outputs or cleaning up explanations.